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Abstract 

We present the first worst-case linear-time algorithm to compute the Lempel-Ziv 78 factor¬ 
ization of a given string over an integer alphabet. Our algorithm is based on nearest marked 
ancestor queries on the suffix tree of the given string. We also show that the same technique 
can be used to construct the position heap of a set of strings in worst-case linear time, when 
the set of strings is given as a trie. 


1 Introduction 

Lempel-Ziv 78 ( LZ78 , in short) is a well known compression algorithm [TO|. LZ78 compresses a 
given text based on a dynamic dictionary which is constructed by partitioning the input string, 
the process of which is called LZ78 factorization. Other than its obvious use for compression, 
the LZ78 factorization is an important concept used in various string processing algorithms and 
applications mm- 

In this paper, we show an LZ78 factorization algorithm which runs in 0{n ) time using 0{n ) 
working space for an integer alphabet, where n is the length of a given string and m is the size of 
the LZ78 factorization. Our algorithm does not make use of any randomization such as hashing, 
and works in 0{n ) time in the worst case. To our knowledge, this is the first 0(n)-time LZ78 

factorization algorithm when the size of an integer alphabet is 0(n) and 2 ( lo «i°s") 2 . Our 

algorithm computes the LZ78 trie (a trie representing the LZ78 factors) via the suffix tree [TTi 
annotated with a semi-dynamic nearest marked ancestor data structure mm- 

We also show that the same idea can be used to construct the position heap [7] of a set of 
strings which is given as a trie, and present an 0(£)-time algorithm to construct it, where i is the 
size of the given trie. 

Some of the results of this paper appeared in the preliminary versions 01- 

Comparison to previous work 

The LZ78 trie (and hence the LZ78 factorization) of a string of length n can be computed in 0{n) 
expected time and 0(m ) space, if hashing is used for maintaining the branching nodes of the LZ8 
trie [TO] . In this paper, we focus on algorithms without randomization, and we are interested 
in the worst-case behavior of LZ78 factorization algorithms. If balanced binary search trees are 
used in place of hashing, then the LZ78 trie can be computed in 0(n log a) worst-case time and 
0{m) working space. Our 0(n)-time algorithm is faster than this method when er £ w(l) and 
<r £ 0{n). On the other hand, our algorithm uses 0(n ) working space, which can be larger 
than 0(m) when the string is LZ8 compressible. Jansson et al. |12] proposed an algorithm which 
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computes the LZ78 trie of a given string in 0(n(loglogn) 2 /(log CT n log log log n)) worst-case time, 
using 0(n( log cr+log log CT n)/ log CT n) bits of working space. Our 0(n)-time algorithm is faster than 

theirs when a £ 2 (logiogn)^ anc j a g 0(n), and is as space-efficient as theirs when a £ O(n). 
Tamakoshi et al. j! 6 ] proposed an algorithm which computes the LZ78 trie in 0(n+ (s + m) logcr) 
worst-case time and 0(m) working space, where s is the size of the run length encoding (RLE) of 
a given string. Our 0(n)-time algorithm is faster than theirs when a £ and a £ 0(n). 

The position heap of a single string of length n over an alphabet of size a can be computed in 
0(n log a) worst-case time and 0(n) space [7j, if the branches in the position heap are maintained 
by balanced binary search trees. Independently of this present work, Gagie et al. nn showed that 
the position heap of a given string of length n over an integer alphabet can be computed in 0(n ) 
time and 0(n) space, via the suffix tree of the string. 

2 Preliminaries 

2.1 Notations on strings 

We consider a string w of length n over integer alphabet E = {1,..., a}, where cr £ 0(n). The 
length of w is denoted by |iu|, namely, \w\ = n. The empty string e is a string of length 0, namely, 
|e| = 0. For a string w = xyz, cc, y and z are called a prefix , substring , and suffix of w, respectively. 
The set of suffixes of a string w is denoted by Suffix (w). The i- th character of a string w is denoted 
by w[i] for 1 < * < n, and the substring of a string w that begins at position i and ends at position 
j is denoted by w[i..j] for 1 < i < j < n. For convenience, let w[i..j\ = e if j < i. For any string 
w, let w R denote the reversed string of w, i.e., w R = w[n]w[n — 1] • • • w[l]. 

2.2 Suffix Trees 

We give the definition of a very important and well known string index structure, the suffix tree. 
To assure property [5] below for the sake of presentation, we assume that string w ends with a 
unique character that does not occur elsewhere in w. 

Definition 1 (Suffix Trees [17|). For any string w, its suffix tree, denoted STree{w), is a labeled 
rooted tree which satisfies the following: 

1. each edge is labeled with a non-empty substring of w; 

2. each internal node has at least two children; 

3. the labels x and y of any two distinct out-going edges from the same node begin with different 
symbols in E; 

f. there is a one-to-one correspondence between the suffixes of w and the leaves of STree(w), 
i.e., every suffix is spelled out by a unique path from the root to a leaf. 

Since any substring of w is a prefix of some suffix of w, all substrings of w can be represented 
as a path from the root in STree(w). For any node v, let str(y ) denote the string which is a 
concatenation of the edge labels from the root to v. A locus of a substring x of w in STree{w) is a 
pair (v, 7 ) of a node v and a (possibly empty) string 7 , such that str(v) 7 = x and 7 is the shortest. 
A locus is said to be an explicit node if 7 = e, and is said to be an implicit node otherwise. It is 
well known that STree(w) can be represented with 0{n ) space, by representing each edge label x 
with a pair (i,j) of positions satisfying x = w[i..j]. 

Theorem 1 ([ 8 ]). Given a sti'ing w of length n over an integer alphabet, STree{w ) can be computed 
in 0[n) time. 
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Figure 1: CST(W) for W = {aaba$, bbbaS, ababa$, aabba$, babba$}. Each node u is associated 
with id{u). 

2.3 Suffix Trees of multiple strings 

A generalized suffix tree of a set of strings is the suffix tree that contains all suffixes of all the 
strings in the set. Generalized suffix trees for a set W = {uqS,..., Wk$} of strings over an integer 
alphabet can be constructed in linear time in the total length of the strings. 

Suppose that the set W of strings is given as a reversed trie called a common-suffix trie, which 
is defined as follows. 

Definition 2 (Common-suffix tries 0). The common-suffix trie of a set W of strings, denoted 
CSTfW), is a reversed trie such that 

1. each edge is labeled with a character in E; 

2. any two in-coming edges of any node are labeled with distinct characters; 

3. each node v represents the string that is a concatenation of the edge labels in the path from 
v to the root; 

4■ for each string w £ W there exists a unique leaf which represents w. 

An example of CSTiW ) is illustrated in Figure Q] 

Let £ be the number of nodes in CSTiW), and let SuffixiW) be the set of suffixes of the 
strings in W, i.e., SuffixiW ) = LLevv S u ffi x ( w )- Clearly, £ equals to the cardinality of SuffixiW) 
(including the empty string). Hence, CST{W) is a natural representation of the set Suffix{W). If 
L is the total length of strings in W, then £ < L + 1 holds. On the other hand, when the strings 
in W share many suffixes, then L = 0(£ 2 ) (e.g., consider the set of strings {ab l 1 < * < £}). 
Therefore, CST(W ) can be regarded as a compact representation of the set W of strings. 

Definition 3 (Suffix Trees for CST{W)). For any CST{W), its suffix tree, denoted STreefW), 
is a labeled rooted tree which satisfies the following: 

1. each edge is labeled with a non-empty string which is a concatenation of the edge labels of 
CSTiW); 

2. each internal node has at least two children; 

3. the labels x and y of any two distinct out-going edges from the same node begin with different 
symbols in E; 

4. there is a one-to-one correspondence between the internal nodes of CST(W) and the leaves 
of STree(W), i.e., every string which is represented by a node in CST(W ) is spelled out by 
a unique path from the root to a leaf. 

Notice that the suffix tree for CST{W) is identical to a generalized suffix tree of the set W of 
strings. If a given CST(W) is of size £, then the size of the suffix tree of CSTiW ) is 0(£). 

We will use the following result in our algorithms. 
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Figure 2: The LZ78 trie of abaabaaaabbaab$. Each node numbered i represents the factor /, 
of the LZ78 factorization, where fi is the path label from the root to the node, e.g.: fs = aa, 
h = aab. 


Theorem 2 (|l5j). Given CST(W) of size £ for a set W of strings over an integer alphabet , the 
generalized suffix tree of W can be computed in 0(£) time. 

2.4 Tools on trees 

We will use the following efficient data structures on rooted trees. 

Lemma 1 (Nearest marked ancestor [181 ITj). A semi-dynamic rooted tree can be maintained in 
linear space so that the following operations are supported in amortized 0 ( 1 ) time: 1) find the 
nearest marked ancestor of any node; 2) insert an unmarked node; 3) mark an unmarked node. 

By “semi-dynamic” above we mean that no nodes are to be deleted from the tree. 

Lemma 2 (Level ancestor query [HE])- Given a static rooted tree, we can preprocess the tree in 
linear time and space so that the ith node in the path from any node to the root can be found in 
0 ( 1 ) time for any integer i > 0 , if such exists. 

3 Algorithms 

In this section, we propose algorithms to compute LZ78 factorizations |19] and position heaps (7) 
which run in linear time for an integer alphabet. 

3.1 Computing LZ78 trie from suffix tree 

The LZ78 factorization [T[!] of a string w is a sequence fi,...,f m of non-empty substrings of 
w , where fi ■ • ■ f m = w, and each fi is the longest prefix of w[\fi ■ ■ ■ fi-i\ + 1 .. . n] such that 
fi € {fjC | 1 < j < i,c € £} U £. Each fi is called an LZ78 factor of w. The dictionary of LZ78 
factors of a string w can be represented by the following trie, called the LZ78 trie of w. 

Definition 4. The LZ78 trie of string w, denoted LZ78Trie[w), is a rooted tree such that each 
node represents an LZ78 factor fi, and there is an edge (c, fi) with label c £ £ iff fi = fjC. 

See Figure[8]for an example of LZ78Trie{w). LZ78Trie{w) requires 0{m) space, where m is 
the number of factors in the LZ78 factorization of w. Each factor fi can be computed in 0(|/i|) 
time from the trie, by starting from node ff and concatenating edge labels between f, and the 
root. We compute LZ78Trie{w) as a compact representation of the LZ78 factorization of w. 

We present an 0(n)-time algorithm to compute LZ78 trie of string w of length n over an integer 
alphabet, via the suffix tree of w. In so doing, we make use of the following key observation: since 
LZ78Trie(w) is a trie whose nodes are all substrings of w, it can be superimposed on STreefw ), 
and be completely contained in it, with the exception that some nodes of the trie may correspond 
to implicit nodes of the suffix tree. See Figure [3] for an example. 
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Figure 3: The LZ78-trie of string w = abaabaaaabbaabS, superimposed on the suffix tree of w. 
The subtree consisting of the dark nodes is the LZ78-trie, derived from the LZ78-factorization: 
a, b, aa, ba, aaa, bb, aab, $, of w. 


Theorem 3. Given a string w of length n over an integer alphabet, LZ78Trie(w ) can be con¬ 
structed in 0(n) time and 0(n) working space. 

Proof. Suppose the LZ78 factorization f\ - •• fi-i, up to position p — 1 = |/i • • • ft- 1 | of a given 
string w, has been computed, and the nodes of the LZ78 trie for fi,..., /,;_i have been added 
to STree (w). Now, we wish to calculate the zth LZ78-factor starting at position p. Let 2 
be the leaf of the suffix tree that corresponds to the suffix w[p..n]. The longest previous factor 
x that is a prefix of w\p..n\ corresponds to the longest path of the LZ78 trie built so far, which 
represents a prefix of w\p..n]. If we consider the suffix tree as a semi-dynamic tree, where the nodes 
corresponding to the superimposed LZ78-trie are dynamically inserted and marked, the node x we 
are looking for is the nearest marked ancestor of z, which can be computed in 0(1) time. If x is 
not branching, then we we simply move down the edge by a single character (say a), create a new 
node if necessary, and mark the node representing the zth LZ78 factor /* = xa. If x is branching, 
then we can locate the out-going edge of x that is in the path from x to the leaf z in 0(1) time 
by a level ancestor query from 2 . Then we insert/mark the new node for the zth LZ78 factor _/). 
Technically, our suffix tree is semi-dynamic in that new nodes are created since the LZ78-trie is 
superimposed. However, since we are only interested in level ancestor queries at branching nodes, 
we only need to answer them for the original suffix tree. Therefore, we can preprocess the tree 
in 0(n) time and space to answer the level ancestor queries in 0(1) time. Finally, we obtain the 
LZ78-trie by removing the unmarked nodes from the provided suffix tree. □ 

3.2 Computing position heap from suffix tree 

Here, we show how to compute the position heap of a set of strings from the corresponding suffix 
tree. We begin with the definition of position heaps. 

Let W = {wi$,W 2 §, ■ ■ ■ ,Wk$} be a set of strings such that w$ £ Suffix{wj%) for any 1 < 
i ^ j < k. Define the total order on E* by x -< y iff either |rc| < \y\ or \x\ = \y\ and x R is 
lexicographically smaller than y R . Let Suffix ^{W) be the sequence of strings in Suffix{W) that 
are ordered w.r.t. -< and let t = \SuffixffiW)\. For any 1 < z < l, let s,; denote the zth suffix of 
Suffix ^(W). 

Definition 5 (Position heaps for multiple strings). The position heap for a set W of strings, 
denoted PH(W), is the trie heap defined as follows: 
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Figure 4: PH(W ) for W = {aaba$, bbba$, ababa$, aabba$, babba$}, where Suffix ^(W) = (e, $, 
aS, ba$, aba$, bba$, aaba$, baba$, abba$, bbba$, ababa$, aabba$, babba$). The node labeled 
with integer i corresponds to Si. 

1. each edge is labeled with a character in £; 

2. any two out-going edges of any node are labeled with distinct characters; 

3. the root node is labeled with 1 and the other nodes are labeled with integers from 1 to £ such 
that parents’ labels are smaller than their children’s labels; 

4■ the path from the root to node labeled with i (1 < i < £) is a prefix of Si. 

Notice that PH(W ) can be obtained by inserting, into a trie, the strings in Suffix ^(W) in 
increasing order w.r.t. -<. In this paper, we assume that CST(W) is given as input, but Suffix ^,(W) 
is not explicitly given. For each s* in Suffix^(W), let id(si) = i. We would like to know id(s) for 
all suffixes s represented by CST(W), which gives us the ordering of strings in CST(W) w.r.t. 
Suffix ^(W). 

Lemma 3. id(s) for all nodes s in CST(W ) can be computed in 0(1) time. 

Proof. We firstly construct the suffix array of CST(W) in 0(£) time, using the algorithm proposed 
by Ferragina et al. [3]. This gives us the lexicographical order of the suffixes represented by 
CST(W). Secondly, we bucket-sort the nodes of CST(W): we use an array of size x as buckets, 
where x < £ is the length of the longest string in CST(W). We then scan the suffix array from 
the beginning to the end, and insert each node (string) s into the |s|th bucket (entry) of the array. 
This gives us id(s) for all nodes s in CST(W) in 0(£) time. □ 

For any 1 < i < £, where £ is the number of nodes of CST(W). let CST(W) 1 denote the 
subtree of CST(W ) consisting only of the nodes Sj with 1 < j < i. PH(W )® is the position 
heap for CST(W) 1 for each 1 < i < £, and in our algorithm which follows, we construct PH(W ) 
incrementally, in increasing order of i. 

We present an 0(f?)-time algorithm to compute PH(W) from the generalized suffix tree for 
GST(W) with t nodes. Since PH(W) is a trie where each node represents some substring of the 
strings in W, it can be superimposed on the generalized suffix tree of W which is equivalent to 
the suffix tree of CST(W), and be completely contained in it, with the exception that some nodes 
of the trie may correspond to implicit nodes of the suffix tree. See Figure [5] for an example of 
PH(W ) superimposed to the suffix tree of CST(W). We summarize our algorithm as follows. 

Theorem 4. Given CST(W ) with £ nodes representing a set W of strings over an integer alphabet, 
PH(W ) can be constructed in 0(£) time and 0(£) working space. 

Proof. Suppose we have computed the position heap PH(W) l ~ 1 superimposed onto the suffix tree 
of CST(W), and we wish to find the next node which corresponds to suffix Sj. Let z be the leaf 
of the suffix tree that corresponds to the suffix Sj. The longest prefix of Sj that is represented 
by PH(Wy~ 1 corresponds to the longest path of PH (W)* _1 , which represents a prefix of Sj. 
Therefore, this can be found by a semi-dynamic nearest marked ancestor query, and the rest is 
analogous to the algorithm to compute the LZ78 trie of Theorem[3] Finally, we obtain the position 
heap by removing the unmarked nodes from the provided suffix tree. □ 
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Figure 5: The position heap of set W = {aaba$, bbba$, ababa$, aabba$, babbaS}, superimposed 
on the generalized suffix tree of W, which is equivalent to the suffix tree of CST(W). The subtree 
consisting of the dark nodes is the position heap of W. 
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